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Abstract 

An intermediate response measure that accurately predicts efficacy in a new setting can 
reduce trial cost and time to product licensure. In this paper, we define a trial level general 
surrogate as a trial level intermediate response that accurately predicts trial level clinical re¬ 
sponses. Methods for evaluating trial level general surrogates have been developed previously. 
Many methods in the literature use trial level intermediate responses for prediction. However, 
all existing methods focus on surrogate evaluation and prediction in new settings, rather than 
comparison of candidate trial level surrogates, and few formalize the use of cross validation 
to quantify the expected prediction error. Our proposed method uses Bayesian non-parametric 
modeling and cross-validation to estimate the absolute prediction error for use in evaluating and 
comparing candidate trial level general surrogates. Simulations show that our method performs 
well across a variety of scenarios. We use our method to evaluate and to compare candidate 
trial level general surrogates in several multi-national trials of a pentavalent rotavirus vaccine. 
We identify two immune measures that have potential value as trial level general surrogates and 
use the measures to predict efficacy in a trial with no clinical outcomes measured. 

Bayesian non-parametrics; Cross-Validation; Meta-analysis; Surrogate markers; Vaccines 
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1 Introduction 


Many different definitions of a surrogate exist. The most common definition is an intermediate 
response measurement that accurately predicts the clinical endpoint of interest. This definition lacks 
some necessary details about what ‘accurately predicts’ means and in what setting this prediction 
is done, but it is a useful starting point. Gilbert et al. 120081 refined the concept of a surrogate in 
the vaccine trial setting, proposing three levels: correlates, specific surrogates of protection, and 
general surrogates of protection. A correlate is a biomarker that is associated with the outcome 
in the vaccine arm or over both arms and a specific surrogate of protection, is a biomarker that is 
associated with efficacy. The definition of a general surrogate of protection is most closely related 
to the common definition of a surrogate: a biomarker on which the treatment effect can be used 
to accurately predict the treatment effect on the clinical outcome in a new setting. These levels 
are informative for general clinical trials as well. To make clear that we are considering the more 
general clinical trial setting, we will call biomarkers that are predictive of the clinical treatment 
effect in a new setting general surrogates (GS). Numerous papers have investigated the evaluation 

[2000 


of biomarkers as GS, including Daniels and Hughes 1997 , Gail et al. 


[2001 , 

Dai and Hughes 

[ 2012 ;. 

The definition of a GS in 

Gilbert et al. 

2008 


Burzykowski et al. 


does not define in what new trial settings the GS 
will be useful or if the surrogate is at the individual or trial level. An individual level surrogate is 
a measurement the treatment effect on which is predictive of the clinical effect at the individual 
level. The effect of treatment on a trial level surrogate can be used to infer what the trial results 
would have been had the clinical outcome been measured. A GS can be predictive at the trial level 
or the individual level, or both. As is outlined in Korn et al. [2005 , these two types of surrogacy 
do not imply each other and can be unrelated. Most of the literature focuses on the evaluation of a 
potential trial level general surrogate in studies in which both the GS and the clinical outcome are 
measured, yet little attention is given to what information is available to support the generalizability 
of the association of the surrogate and the clinical endpoint to a new setting. All existing general 
surrogate evaluation methods, to our knowledge, only consider summary measures for surrogate 
evaluation rather than comparison of candidate GS, and most are based on within sample prediction 
intervals. Only Baker 120061 attempts to formalize the exogenous quantification of the expected 
error when predicting in a new setting, using a similar summary of prediction error to our suggested 
method for a binary surrogate. In addition, most existing methods are specific for the type of data 
collected and require strict modeling assumptions. Dai and Hughes |2012 propose a general method 
for evaluating a trial level GS, but they require individual level data be available in all trials and 
suggest a within sample evaluation of prediction error based on a linear relationship between the 
treatment effects. 

In this paper, we provide a precise definition of a trial level general surrogate. We propose a 
general and flexible evaluation and prediction method that differs from existing meta-analytic eval¬ 
uation methods in several ways. Our suggested estimation method can be used on individual level 
or trial level data, similar to Daniels and Hughes [l997j (DH). It allows for a flexible association 
between the treatment effect on the trial level general surrogate and the treatment effect on the 
clinical outcome and a flexible distribution for both true treatment effects over the trials. Our pro¬ 
posed evaluation method uses a Bayesian cross-validation approach to quantify the prediction error 
when estimating the treatment effect in a new setting. We propose the use of absolute prediction 
error, similar to that of Tian et al. 2007 as a summary error for evaluation and comparison of 
candidate trial level general surrogate. The absolute prediction error is easily interpretable because 
it is on the scale of the treatment effect on the clinical outcome. We also outline a suggested 
nomenclature for classifying support for the generalizability of an evaluated trial level general sur- 
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rogate. We build on the work of Daniels and Hughes [1997 , taking a Bayesian approach to trial 
level general surrogate evaluation while allowing for the within trial estimation approach of Dai 
and Hughes [|2012 when individual level data are available. Software implementing our method in 
R and JAGS is available in the supplementary materials. 

We use our method to evaluate and compare trial level general surrogates for the pentavalent 
rotavirus vaccine RotaTeq™(RV5) (Merck & Co. Inc., Kenilworth, New Jersey). RV5 has been 
shown to be efficacious against rotavirus gastroenteritis in many settings, but in developing countries 
where the disease burden is highest, this efficacy is lower than in developed nations. A universal 
and clear measure of post-randomization immune response that explains the efficacy differences 
among the trials and that could potentially accurately predict efficacy in future settings has yet to 
be clearly identified. We use our approach to attempt to identify trial level general surrogates in 
the setting of the RV5 trials. 

In Section [2] we outline our proposed methods for estimation, prediction, evaluation and com¬ 
parison. In Section [3] we explore the operating characteristics of our methods in several settings, 
one of which follows closely the setting of the RV5 trials. In Section [4] we outline our suggested 
nomenclature for the generalizability of a TLGS. In Section [5] we use our method to investigate 
TLGS of rotavirus gastroenteritis of any severity in the RV5 trials, and in Section [6] we summarize 
our findings and suggest possible extensions to the proposed methods. 


2 Methods 


We refine the definition of a GS from Gilbert et al. 1 2008 j to indicate whether the prediction is at 
the individual or trial level and the characteristics of the new setting where such a GS might be 
used. If the trial level treatment effect on a biomarker can be used to predict (with low prediction 
error) a trial level clinical treatment effect and this predictive association is generalizable to a new 
setting, then the biomarker is a trial level general surrogate (TLGS) for that clinical outcome in 
that new setting. 

This definition clearly states that the surrogate is at the trial level and indicates the settings 
where the TLGS can be used. 

To outline our proposed methods for evaluation and comparison of biomarkers as TLGS, we 
first define some notation. For subject i E { 1 ,..., AL,} in trial j E {1,..., J}, let Z t j = {0,1} 
be the treatment indicator, Y t j the clinical endpoint (the same over all trials) and A t j k the fcth 
biomarker measure, k E {1 At the trial level, let T\ jj c be the true treatment effect on 

the kth biomarker measurement and let T^.j be the true treatment effect on the clinical outcome. 
Let Nj k and Nj ^ be the set of subjects with the fcth candidate TLGS and the clinical outcome 
measured in trial j, respectively. 
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2.1 Within Trial Models 


2.1.1 Models for Subject Level Data 

When subject level data are available in one or more trials, we specify the following linear models 
for the (transformed) mean, median, or hazard for the candidate TLGS and the clinical outcomes, 

h l,fc(T[y4j j k|Zi y]) Coj,k T Tl , ( 1 ) 

h2(F[Yi,j\Zij]) = 7oj + T2jZij, (2) 


where Coj,k and 7 oj denote the jth trial specific intercepts for the /cth biomarker measurement 
and the clinical outcome, respectively. Models ([Tj) and ([ 2 ]) are specified in terms of the distribution 
function, F(.), of A and Y and can be any set of models from which asymptotically normal coefficient 
estimates are obtained, including generalized linear models and survival models. We fit the models 
separately for each trial and response (outcome or biomarker). For this reason, Njk and Nj 2 , the 
subjects in each trial used to fit 0 and 0 respectively, do not need to have both the outcome 
and biomarkers provided the (implicit) missingness is random and/or by design. We denote the 
estimates of the treatment effects as and T 2 y with standard errors, ayy^ and 077 , j G 


(1 . >}■ 

Let Oj k be the vector of estimated treatment effects and estimated standard errors over all 
J trials for the fcth candidate GS; Oj k = (Tj^fc ... Tiy,*,, T 2 ,i ... f 2j J, ?i,i ,k ■ ■ ■ <Hy,fc, <? 2 ,i ■ ■ ■ & 2 ,j) 
and define Oj to be the vector of all estimated treatment effects and standard errors over all K 
candidate GS and clinical outcome data in a given set of J trials. We will call Oj the vector of 
observed ’data’. Also define T 2 = (T 2i 1 ... T 2j j) and Ti^ = (Tj^fc ... Tyj^). 

Dai and Hughes |2012] use an estimating equation approach to estimate the joint sampling 


distribution of the estimated treatment effects from ([I]) and Q. Using their approach the correlation 
between the estimated treatment effects can be estimated. However, an estimate of this covariance is 
not required to evaluate the candidate TLGS (|Daniels and Hughes, 1997| ), and complex missingness 
by design can make this covariance estimation complex. 


2.1.2 Approximate Likelihood 

We assume that the estimated treatment effects are consistent and asymptotically normal and can 
be well approximated by 


Tij,k ~ N(T hjjk ,al Jjk ) and T 2jj « N(T 2jj , d% d ). 

Note that the estimated treatment effects can be obtained by fitting models 0 and 0 or from 
the literature. We denote these models for the estimated treatment effects given the true treatment 
effects by /ijfc(Tand ,/ 2 y(T 2 7 1T 2 7 , 677 ). The joint distribution of the estimated 
treatment effects is given by 


TjylTjy^, T 27 , (Tiy^, <J 2 y) — fljk{'F'l,j,k\'F'lj j ki^l,j,k) x f 2 ,j (T 27 1 T 2 j, 077 ), (3) 
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where we assume independence conditional on the true treatment effects. Based on these (ap¬ 
proximate) trial specific models, the likelihood over all the trials for the kth TLGS is given by 
n/=i as the estimates between trials are assumed to be inde¬ 

pendent. In what follows, the estimated standard errors &ij,k and 02 j,k will be treated as fixed. 

When the individual level data is available, we can estimate the covariance of (Tij^,T^j), and 
replace the two independent normals with a bivariate normal. One could also use the individual 
data directly by using models @ and (J2]) to form the likelihood. However, such a formulation is 


unlikely to substantially improve evaluation of GS at the trial level as was demonstrated in Korn 
2005 . Therefore, we focus on a two step approach: first, estimating the treatment effects 


et al. 


within each of a set of trials, and second, constructing an approximate likelihood based on those 
estimates. 


2.2 Between Trial Model and Priors 


We specify a non-parametric prior for the distribution of the true treatment effect on the candidate 
GS, In particular we propose a Dirichlet process mixture of normals [MacEachern and Muller, 


1998 


T h j,k ~ 

(/H ,j,k,n,j,k) ~ 
G ~ 


,k) 

G 

DP(lu,G 0 ) 


(4) 


For uj, we specify a 17(1, J) prior. We use this prior both to bound ui away from zero and to 
allow for clustering in the true trial level biomarker effects, Tij^. We specify the base measure 
Go to be the product of a normal and a gamma distribution where each l-i\.j.k follows a normal 
distribution with mean and variance < 7 ^ k and each 1 / rij 7 k follows a gamma distribution with 
parameters ('f/h,fc) £i,fc)- Denote the vectors of trial level parameters as fiik = (ni 7 \ 7 k, ■ ■ ■, Hi,j,k) 
and rife = (rggfc,..., T\,j 7 k)- We use data-dependent hyper-priors for the parameters in the base 
measure as recommended in Taddy 12008 . 


We now specify a flexible model for the true trial level clinical treatment effect given the true 
trial level treatment effect on the TLGS, Specifically, 27 j = m(Ti 7 j 7 k, Pki^k) + 

where 6j 7 k are independent IV(0, cx^). The function m(27j,fc, /3k, bk) is defined as: 


M 

Ttl(Ti 7 j 7 ki Pki bk) — Po,k T Pl,kP\,j,k + E bm,k \Tl,j,k r m\ (5) 

m =1 


The ri < r 2 < ... < tm are fixed knots/changepoints. The coefficients associated with each 
knot, b rn j. are penalized/shrunk using N( 0, ), with given an inverse gamma(l,3) prior as 

12000. The parameters Po,k and Pi 7 k are given independent 
We use a diffuse inverse-gamma prior for o 2 . The full set 


recommended in Crainiceanu et al. 
IV(0, cx| ) priors, with 1/cr^ set to le -6 


of hyper-parameters will be denoted as v = {r 2 i k , /x lfe , <pi,k, 0 ^ h , Vh,fc, €i,k, b k , , a 2 , Po,k, Pi,k, 
w}. 


For the evaluation procedure presented in Section 2.3, we also need the null distribution of T 2 , 
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i.e., the distribution of the T 2 independent of Ti. We do not use the marginal T 2 derived from 
the above models to avoid any potential model misspecification for the model for T 2 IT 1 . Instead, 
we specify the (null) prior for the true clinical treatment effects T 2 ignoring all candidate TLGS 
information using the same specification and priors as in ([ 5 ]). 

To compare candidate trial level surrogates, more than one candidate will need to be available 
in a given set of completed trials. When K candidates are available, we use an additive model for 
[T2,j\Tij t k, ■ T r 2 ,j = rn k(T\.j.k, /3fc, bk) + ej, where m has the form as (j5j) . The model 

containing all information from all K candidates is called the full model. 


2.3 Evaluation and Comparison 

Our objective is to predict the true clinical treatment effect in a new setting J + 1 where the 
estimated treatment effect on the clinical outcome, T 2 j + i, is not observed. For this prediction we 
use the posterior distribution of T 2 .J+ 1 , the true treatment effect on the clinical outcome in the 
new trial J + 1, given the observed data, which includes T\j + and d\j+\.k in addition to Oj k, 
i.e., T 2 ; j + i|Oj ! k, 7i,j+i,fci As we are interested in comparing and evaluating candidate 

TLGS based on their predictive power, we need to determine the quality of the predictions from 
this distribution. 

To evaluate predictive accuracy and compare candidate TLGS, we will use the expected absolute 
prediction error, 


Dj+i^k = E\T 2 ) j+i - T 2 * J+1 |, 


( 6 ) 


where Tf j , 1 is our point prediction of the treatment effect on the outcome in the new setting (trial 
J + 1). The summary Dj + \^ is the expected absolute prediction error for a new trial setting J +1. 


The merits and properties of the expected absolute prediction error were outlined in Tian et al. 


12007 for independent and identically distributed data where the true outcome, T 2 , was observed. 
In our setting, the T 2 are unlikely to be identically distributed, although we assume that they are 
independent, and they are never observed. To estimate Dj + 1 ^, we could estimate the error we 
make for each observed trial j by comparing a leave-one-out estimate, Tf ■ to the true T 2 j value, 
dj^, and then average over the J trials, 


D 


j+i,k - (V-O dj,k - i 1 /- ±2,j\ 


-To* 


(7) 


A cross-validated estimate of this nature was found to have good finite sample size properties for 
iid data in Tian et al. |2007 . Recall though, that we do not observe the true T 2 j, so we cannot 
directly use (|7j) either. 

Our leave one out prediction of T 2 j, Tf ■, is computed from the posterior obtained by excluding 
Toj and a 2 j as well as all information about all other candidate GS, from the data, 


T 2J ~ |Oj(_j) ;k ) T\ j k^ j 
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where ^ is all information about the clinical outcome and the kth candidate GS in all 

{1,..., J} trials less trial j. Since we do not observe the true T 2 j, even in the evaluation trials, we 
will use an approximation to the distribution of the absolute prediction error to assess the TLGS. 
In particular, we use 


dj,k\°J ~ I t 2 ,3 - 


( 8 ) 


the posterior distribution of the approximate absolute prediction error for trial j using all the data, 
Oj. HereT^. = £[T 2j |O j( _j) ki Tij t k, &i,j,k}\- By taking the expected value of i.e., with respect 
to the posterior distribution, T 2 J|Oj, we obtain an approximation of dj^ in (M). For evaluation of 
a TLGS in an arbitrary new setting, we use the mixture distribution, Dj + \^ (1 /J)Ejdj,k. The 
expectation of Dj +gfc is an approximation of (Jt| and is a single value comparison for candidate 
TLGS in a new setting. 

The distribution of Dj + 1 ^ is the distribution of interest for comparing candidate TLGS and 
as Dj + 1 is on the scale of the true clinical effect, a first step in TLGS evaluation can be a 
comparison of Dj+i & to the smallest clinically relevant effect, as was suggested in 


Baker 


2006 


However, to evaluate the absolute quality of a candidate TLGS we will compare Dj + 1 to the 
absolute prediction error in the absence of T 2 .j or 7j we will denote this as Dj + 1 ^. We use the 
null model prior for the distribution of the true T 2] introduced at the end of Section 2.2. Similar to 


the development for d 3 j~. we use the following approximate distribution of the absolute prediction 


error, 


dj,o|Oj ~ \T 2 j - E[T 2 ,j |0 J( _j _!)]!, 


(9) 


where Oj(_j _!) denotes the observed clinical outcome information less that in trial j, i.e., Oj(_j = 
(T 2 , 9 ,a2,,) : q G {l,...,j — 1 ,j + 1,..., J}}. The distribution £>j+i,o ~ ( 1 /J) £L Dj+1,0 is a 
good baseline for candidate TLGS evaluation because it quantifies the amount of error that occurs 
when no potential TLGS are used in the model. 

To evaluate and to compare candidate GS based on Dj + and Dj + , a joint graphical 
representation of the density estimates can be useful. Densities of Dj+\k that have more mass 
at 0 with shorter tails are evidence of superior candidates, as this implies greater probability of 
lower prediction error. Figure [l] is an illustration of how the density plots can be used to compare 
and to evaluate candidate trial level general surrogates. Both candidate TLGS in Figure [l] are 
superior to the null, Dj+ i,o, while the density of Dj+ 1,1 in comparison to that of Dj+i ,fc suggests 
that candidate l is superior to candidate k ; in this example, candidate l is an ideal TLGS, with a 
true Dj+ij = 0, from using the true T 2 j as the estimated treatment effect on the candidate GS. 
However, the kth candidate is still a very useful TLGS. 

The probability P(Dj + < Dj + 1 ^), can be used to quantify the strength of evidence that a 
given candidate TLGS has any value. Small probabilities suggest that there is evidence that the 
kth candidate has value as a TLGS to aid in the prediction of the clinical treatment effect in a 
new trial setting. One can also rank candidates using the set of point and interval estimates from 
the set of {Dj + \ y i ,..., Dj + i t x} distributions. The probability P{Dj + \^ < Dj+ 1 , 1 ), can be used 
to quantify the strength of evidence for the superiority of a given candidate over another within 
the same set of evaluation trials. Here, small probabilities suggest that there is evidence of the 


7 







6 - 



Figure 1: Density estimates for Dj + i^, Dj + ij and Dj + 1 ^. Candidate TLGS k is a good, but not 
perfect GS. Candidate l is an ideal TLGS, with a true Dj + \ i = 0. Dj+i ,o corresponds to the 
model containing no candidate GS. Large mass closer to 0, and less spread away from zero, for 
Dj + i k and Dj + i / , suggests a smaller mean and lower variance of the approximate distribution of 

the absolute prediction error. 


superiority of the Ith TLGS candidate over the kth candidate, while large probabilities provide 
evidence of the opposite. 


2.4 Posterior computation 

Computation of all quantities of interest requires four steps, which we outline below. 

1. Sample from the posterior of the full model using all observed data to obtain the marginal 
posterior of each T-^.j in JAGS. This step provides the best estimate of the ’true’ value of the 
treatment effect on the clinical outcome. 

2. Removing the observed clinical data information for trial j, we sample from the corresponding 
posterior in JAGS to obtain the leave-tlie-Rh-trial-out-estimate, Tg ■. We repeat for all J 
evaluation trials. 

3. Using only the clinical information from the other trials besides trial j and removing all 
candidate surrogate information, we sample from the corresponding posterior to obtain an 
estimate of the true treatment effect on the clinical outcome, the marginal or null leave-the- 
jth-trial-out-estimate. We repeat this for all J evaluation trials. 

4. We use the posteriors of T 2 from steps 1 and 2 to obtain the distribution of Dj + and the 
posterior distributions of T 2 from steps 1 and 3 to obtain the distribution Dj+i,o- 

We use the posterior from step 2 to estimate the true clinical treatment effect in a trial setting 
where only the candidate surrogate is measured. 
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3 Simulations 


We investigate the performance of our method in several scenarios, including for different rela¬ 
tionships between the true treatment effect on the clinical outcome, Tgj, and the true treatment 
effect on the surrogate, T\,j,k and for different distributions governing the true Tij^, including a 
single normal or a mixture of normals. For each of these scenarios we investigate the impact of the 
magnitude of the errors with which the treatment effects were estimated within the trials, &\ j t k 
and <72 j. For the scenarios where T\j k are generated from a normal distribution, we fix the mean 
and standard deviation (SD) of each to be 2 and 1, respectively. In the mixture of normals 
settings, we generate Tjjj,, from an equally weighted mixture of 3 normals with means (—1,0,2) 
and SDs of 1. We fix the SD of the generated conditional on X) i *., to be 1. 

For each scenario we simulate 200 datasets of 20 trials, unless otherwise specified. The exact 
specifications of the simulation scenarios are given in Appendix A of the supplementary materi¬ 
als; the R and JAGS code used to generate the data and fit the models is also available in the 
supplementary materials. The last scenario in each table is based on the rotavirus vaccine trial 
application, with 12 trials and using the set of T\ and Sq from one of the candidate GS as well as 
assuming a strong linear association between the true treatment effects. For our method outlined 
in Section 2, we use a piecewise linear model as given in ([5]) with knots at the 33rd and 66th 
percentiles of the T 2 , for the prior on T/jjTj We compare our method’s predictive accuracy, as 
described in Section 2.4, with: 1) a linear regression on the estimated treatment effects (referred to 
as ’linear model’ in the tables) and 2) the method of Daniels and Hughes 11997] (referred to as DH 
in the tables). 


Table 
x|™ e is t: 


l] reports the mean and SD of D J+1>k = (1/J) |X 2 j - £[T 2 j|Oj(_j), 

le true clinical treatment effect for trial j and J is the number of trials. Table |l| also 


displays the mean and bias of the posterior mean of Dj + \ k . given in (J8J) , in comparison to Dj +1 
Both the DH method and our proposed method estimate Dj + \ k more accurately than the linear 
model in all scenarios. The DH method and our proposed method have similar bias, both well 
below Monte Carlo error; this suggests relative unbiasedness for both methods. The Dj + i k are 
always lower for our proposed method in non-linear scenarios and similar in linear scenarios. 

Table [i] reports the average P{Dj + po < D.j+i,k) over the 200 simulated datasets for each sce¬ 
nario and method. The averages are near 1/2 for both the DH method and our proposed method 
in the first row of Table [2j the scenario where there is no useful candidate surrogate. In addition, 
the lower probabilities in all other scenarios suggest that this probability can help quantify the 
value of a candidate as a TLGS. The low probabilities for all scenarios for the linear model are 
caused by the biased estimation of Dj + i jk , which makes P(Dj + ^0 < Dj + 1 ^) under this method 
non-informative for evaluating TLGS candidates. Table [2^also presents the average probability for 
comparing two candidates, P{Dj + \^ < Dj + iq), where Dj+iq is the estimate based on a TLGS 
with a true Dj + iq = 0, from using T|™ e as the estimated treatment effect on the candidate GS. 
The estimates of P(Dj + i i < Dj + i k ) being larger than 1/2 in almost every case suggests that this 
probability can be used to discern between candidate general surrogates of differing quality. 

Tables S.2 and S.3 in the supplementary materials summarize the simulations using subject level 
data (as opposed to trial level summaries directly) to compare the two different ways of formulating 
the within-trial likelihoods. We find that the methods are basically comparable, with the working 
independence version of the likelihood having lower bias in the estimation of Dj + \ k by Dj + i k , 
although the bivariate normal likelihood method tends to have slightly smaller Dj + 1 & on average. 
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4 Generalizability 


We have outlined a method for evaluating and comparing candidate TLGS. Now we attempt to 
quantify the generalizability of a TLGS to a new setting. Suppose we have a new setting (J + 
1), where only the treatment effect on the TLGS is estimated. We can use our posited method 
to evaluate and compare candidate TLGS in the J completed trials, then estimate the clinical 
treatment effect in this new setting. However, why should we assume that this estimate is valid 
in this new setting? If this new setting is similar enough to the J settings where the TLGS was 
evaluated, we would have more confidence that the relationship between the TLGS and the clinical 
outcome will persist in this new setting. 

Information to support generalizability can be based on the characteristics of the new setting, 
such as the ethnic origin and age range of the trial subjects, relative to the evaluation settings. If 
information on the J evaluation settings and the new setting (J+l) is available to assess whether the 
new setting’s and evaluation settings’ characteristics are similar, it will strengthen generalizability. 
The characteristics that vary over the evaluation trial settings are often referred to as the units 
of variation. Although this is often pointed out in meta-analytic papers, the implications of this 
variation are typically not explored or discussed. 

Here, we suggest guidelines for conveying the information available to support the assumption 
of generalizability. In particular, we propose three ordered classes of generalizability support: 
represented, within range and outside the range. If the new trial setting is exactly the same as the 
evaluation setting in terms of available characteristics, findings are strongly generalizable. When 
all the characteristics of a new trial are present in at least one of the observed trials, but not all 
in the same one, we call the support of the generalizability assumption represented. When not 
all characteristics of a new setting are represented in an observed trial, but all the characteristics 
are within the range of the observed trials, we call the support for the generalizability assumption 
within range. An example of this third type of support is a new trial with participants between the 
ages 50 to 60 years, when previous trials enrolled 40-50 year old and 60-75 years old subjects. On 
the other hand, if a new setting only enrolled 20-30 year old participants, we would call the support 
for the generalizability assumption outside range. Clearly the evidence to support generalization 
would be expected to decline from represented to within range to outside range. 

The reliability of the above classifications hinges on observing a large number of (the same) char¬ 
acteristics in the evaluation settings and the new study. If there are very few observed characteristics 
in any of the J evaluation settings or the new setting, the information to support generalizability is 
limited. The proposed nomenclature is solely a suggestion for succinctly conveying the information 
available to support the assumption of generalizability. 


5 Application: Pentavalent Rotavirus Vaccine 

More than 450,000 children under five years died from complications of rotavirus infection each year 


Tate et al. 


2012 


prior to vaccine availability. The pentavalent rotavirus vaccine RotaTeq™(RV5) 
developed by Merck has been licensed for use in over 120 countries. The rotavirus Efficacy and 
Safety Trial (REST) against severe rotavirus gastroenteritis [Ves ikari et al. 2006 was conducted 
in 11 countries. In the substudy of REST in Finland and the United States estimated efficacy was 
as high as 98%. However, in other regions in Africa and Asia, lower efficacy has been observed 
[Armah et al. 2010, Zaman et al., 2010| . Lower efficacy may be related to difference in participants’ 
immune system function. And if this is the case, such measurements could be used to better predict 
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Table 3: Data available for each trial used in the analysis in Section 5. 


Protocol Number 

Region/ 

Grouping 

Country 

Number of Subjects with 
Clinical Outcome IgA Gl 

015 

Asia 

Bangladesh 

1116 

146 

146 

005 + 

Finland 

Finland 

1027 

644 

647 

006 

Finland 

Finland 

2324 

358 

1503 

007 

Finland 

Finland 

637 

54 

54 

015 

Africa 

Ghana 

1971 

143 

143 

015 

Africa 

Kenya 

1137 

128 

128 

015 

Africa 

Mali 

1667 

137 

137 

006 

Native lands 

United States 

583 

207 

207 

006 

US concomitant 

United States 

1239 

106 

104 

006 

US non-concomitant 

United States 

366 

220 

210 

007 

United States 

United States 

478 

97 

98 

015 

Asia 

Vietnam 

871 

149 

149 

006 

Asia 

Taiwan 

0 

99 

99 


f Only the placebo and low, middle, and high dose RV5 groups were included. 


efficacy in future settings. 

We investigate several candidate trial level general surrogates using data from four phase II 
and III studies of RV5 conducted in seven countries: Finland, the United States, Vietnam, Mali, 


Bangladesh, Ghana and Kenya Vesikari et al. 

2006, 

Heaton et al.| 

2005, 

Armah et al. 

2010, 

2012 

Zaman et al. 

2010 

Shin et al. 

2012 . Participants from Finland and the United States were included 


in more than one study. Some of the trials involved more than one country. For the purpose of 
this analysis, we assume participants from different countries in the same trial to be independent 
trials as it is unlikely that outcomes will be correlated between countries even within the same trial 
setting. Table [3] shows the 12 trials (based on study and country) used as our evaluation trials, 
and a 13th trial in Taiwan where only the potential TLGSs are measured. The data were provided 
by Merck Sz Co. Inc., Kenilworth, New Jersey, through data sharing agreements with the Fred 
Hutchinson Cancer Research Center and the National Institute of Allergy and Infectious Diseases. 

For this application we consider rotavirus gastroenteritis of any severity as the outcome of 
interest. Dose, endemic burden of disease, age range, ethnicity, and region were available for each 
trial. Several candidate TLGS were measured in a subset of individuals. These included serum 
anti-rotavirus IgA B-cell responses (IgA), as well as serum neutralization antibody (SNA) to the 
human rotavirus serotypes Gl, G2, G3, G4 and P1A. 

Given that the different trials collected all the markers on only a random sample of those with the 
clinical outcome, we did not use the method of Dai and Hughes 120121, but instead present results 
using the working-independence model ([3]) based on estimates obtained from fitting independent 
generalized linear models in each trial for the available immunogenicity and clinical outcome data. 

Figure [2] displays the point estimates and 95% credible intervals (Cl) for the vaccine effect on the 
clinical outcome (rate of gastroenteritis of any severity), T- 2 j and the vaccine effect on two selected 
immune markers, SNA Gl and serum anti-rotavirus IgA, Ti j Due to the limited number of trial 
settings, we limited the number of candidates for comparison to the two candidates that seemed 
to have the best association with outcome as seen in Figure [2] Figure SI in the supplementary 
materials describes this same relationship for all the other biomarkers collected. The smooth curve 
overlaid in all figures is the posterior mean from the regression model given in ([ 5 ]). 
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Estimate T1 and 95% Cl 


Estimate T1 and 95% Cl 


Figure 2: Association of Estimated Vaccine Effects. Each sub-figure displays the relationship 
between the estimated treatment effect on the candidate trial level GS, on the x-axis, and 

the estimated treatment effect on the clinical outcome, T 2 j, on the y-axis, over the 12 trial 
settings. The two candidates of interest are G1 or IgA. The crossbars depict the 95% Cl for the 
estimates. The smooth curve overlaid is the posterior mean from the spline model given in (|5j). 


The T 2 j estimates and standard errors are obtained by fitting Poisson regression models, in¬ 
cluding an offset for the log of follow-up time. This is the same model used in the clinical papers 


on these trials Heaton et ah, 

2005 

Vesikari et ah, 

2006, 

Armah et al. 

2010 

2012 

Zaman et al.[ 

2010, 

Shin et ah, 2012 . The clinical outcome in each trial is the difference in 

log rates of rotavirus 


gastroenteritis of any severity between the vaccinated and unvaccinated. A linear model was used 
to estimate the vaccine effect on the two potential TLGS with the effect being the difference in the 
log titer level between the vaccinated and unvaccinated participants. 

For ease of notation let k = {IgA, Gl}. The mean of the predictive error distribution, -Dj+gGi is 
0.29 with a 95th percentile of 0.8; for -Dj+gigA, 0.36, with a 95th percentile of 1.03. We find evidence 
that both serum anti-rotavirus IgA and SNA Gl have value as trial level general surrogates for 
rotavirus gastroenteritis of any severity in settings where they are generalizable. This can be seen in 
Figure [ 3 ] and from the probabilities P(D j+go < -Dj+gGi) = 0.19 and P{Dj + go < D j+ijqa) = 0.22 
being less than 0.5. The probability P{Dj+i,gi < Dj+\j g A) = 0.46 suggests the two candidates 
are similar in quality. These results can also be seen in Figure [ 3 J as both Dj+\j g A and -Dj+gGi 
have more mass closer to 0 and less spread than Dj + go; there is also weak evidence that Gl is a 
slightly better TLGS than IgA as indicated by the probability P(D j + 1 ,gi < Dj+i,igA ) = 0.46 and 
by the higher peak of the Dj+\.g\ density that is close to 0. 


Only selected sites had the clinical outcome measure collected in the REST clinical trial Vesikari 
Immune measurements alone were taken at the Taiwan site. We estimate the true 


et ah, 2006 


clinical vaccine effect at the Taiwan site, T 2 x a iwan (‘Taiwan’ corresponds to J + 1) to be -1.51 
based on the vaccine effect on Gl with 95% Cl (—2.3, —0.90); this corresponds to a vaccine efficacy 
estimate of 78% with 95% Cl (59%, 90%) against rotavirus gastroenteritis of any severity. Given the 
95th percentile of the distribution Dj+i gi and the the upper 95% Cl limit of the credible interval 
for T 2) Taiwan it is unlikely there would not have been positive efficacy in Taiwan as (—0.9+ 0.8) < 0. 
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Figure 3: Density estimates for Dj+i,i g Ai Dj+i,gi and Dj+ i,o 


Figure S2 of the supplementary materials illustrates this graphically. 

The generalizability of SNA G1 as a TLGS to Taiwan can be classified between within range 
and outside of range ; this trial was at the same dose, age range, burden of disease in the population, 
and general region, Asia, as previous trials. However, previous trials conducted in Asia were in 
Vietnam and Bangladesh, which are socioeconomically different from Taiwan. 


6 Discussion 


We have provided a definition of a TLGS and outlined a flexible Bayesian framework for the pre¬ 
diction of clinical treatment effects in a new setting, given the estimated treatment effect on a 
candidate TLGS. We also proposed a useful summary for the evaluation and comparison of candi¬ 
date TLGS. We demonstrate that our prediction method generally has better predictive properties 
than previous methods, particularly when the true relationship between the treatment effects is 
non-linear. We also describe a nomenclature for conveying the evidence to support generalizability 
of a trial level general surrogate. 

In the application, we find evidence of two useful trial level general surrogates for rotavirus 
gastroenteritis of any severity in the RV5 vaccine trials; similar findings suggesting serum anti¬ 
rotavirus IgA as a surrogate have been presented by Goveia [2014]. We used the treatment effect 
on SNA G1 to predict the clinical efficacy of RV5 against rotavirus gastroenteritis of any severity in 
the Taiwan region and found that it is likely that there would have been positive efficacy observed 
in this region if the clinical endpoint had been collected. As is demonstrated in the application 


and pointed out in Section [2. 1.1[ our proposed method allowed us to use all available outcome and 
immunogenicity data in each trial to estimate the treatment effect on the clinical outcome and 
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the treatment effect on the candidate GS. Unlike other methods, subjects need not have both the 
candidate trial level general surrogate and clinical outcome data in each trial, provided the implicit 
missingness is not informative, as was the case in the RV5 trials. 

The evaluation and comparison methods we have developed can be used with any flexible 
modeling method as is demonstrated by the DH simulation results. Our proposed Bayesian non- 
parametric model could also be used for the full-data model estimation, and other models could be 
considered for the leave-the-jth-trial-out predictions, such as the DH model. Useful extensions to 
our proposed method would be the simultaneous evaluation of the individual level GS and TLGS, 
such as is discussed in Alonso et al. 2015| or the evaluation of surrogate consistency, as discussed 
and demonstrated for a specific meta-analytic setting inlElliott et al. 2015 . The consideration of 


combinations of measures as surrogates is also of interest for future research. 
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